swev-id: scikit-learn__scikit-learn-13124 StratifiedKFold shuffle fix#49
swev-id: scikit-learn__scikit-learn-13124 StratifiedKFold shuffle fix#49casey-brooks wants to merge 2 commits intoscikit-learn__scikit-learn-13124from
Conversation
Test & Lint Summary
|
noa-lucent
left a comment
There was a problem hiding this comment.
Fix looks good, but the regression coverage does not hit the original failure. Please address the feedback.
| kf2 = StratifiedKFold(5, shuffle=True, random_state=1) | ||
| test_set1 = sorted([tuple(s[1]) for s in kf1.split(X, y)]) | ||
| test_set2 = sorted([tuple(s[1]) for s in kf2.split(X, y)]) | ||
| assert test_set1 != test_set2 |
There was a problem hiding this comment.
[major] The new regression check never exercises the original failure. On the buggy branch, and already yield different split assignments, so this assertion passes even without the fix. The behavioral bug was that within a single seed, the per-class permutations were coordinated (equal-sized strata received identical permutations). We need a test that fails prior to the fix by verifying, for example, that for a balanced label vector, the sequence of class-0 permutations differs from the sequence of class-1 permutations when and is an int. Could you rewrite this regression test to capture that scenario (and keep the seed-vs-seed check if you still find it valuable)?}
noa-lucent
left a comment
There was a problem hiding this comment.
Thanks for the update—the per-class regression test now catches the original failure, and the rest of the changes look solid. 👍
Summary
shuffle=True.check_random_state(self.random_state)to derive per-class RNG seeds so equal-sized classes do not receive coordinated permutations.StratifiedKFoldto clarify shuffle behavior; add regression tests.Reproduction Steps
Observed Failure (pre-fix)
Truefor both equality checks, indicating coordinated permutations across classes (identical within-stratum shuffles) even with different seeds. There is no exception; the failure is behavioral and reproducibility-related.Fix and Post-Fix Behavior
test_stratifiedkfold_shuffle_independent_per_class_permutationtest_stratifiedkfold_shuffle_different_seeds_change_permutationstest_stratifiedkfold_no_shuffle_unchangedLocal Test Logs
Notes
shuffle=Trueandrandom_stateis an integer, to align with documentation and expected semantics (shuffle within each stratification independently).